256 research outputs found
Generative Invertible Networks (GIN): Pathophysiology-Interpretable Feature Mapping and Virtual Patient Generation
Machine learning methods play increasingly important roles in pre-procedural
planning for complex surgeries and interventions. Very often, however,
researchers find the historical records of emerging surgical techniques, such
as the transcatheter aortic valve replacement (TAVR), are highly scarce in
quantity. In this paper, we address this challenge by proposing novel
generative invertible networks (GIN) to select features and generate
high-quality virtual patients that may potentially serve as an additional data
source for machine learning. Combining a convolutional neural network (CNN) and
generative adversarial networks (GAN), GIN discovers the pathophysiologic
meaning of the feature space. Moreover, a test of predicting the surgical
outcome directly using the selected features results in a high accuracy of
81.55%, which suggests little pathophysiologic information has been lost while
conducting the feature selection. This demonstrates GIN can generate virtual
patients not only visually authentic but also pathophysiologically
interpretable
Expanding the Understanding of Biases in Development of Clinical-Grade Molecular Signatures: A Case Study in Acute Respiratory Viral Infections
The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution
Automated Discrimination of Pathological Regions in Tissue Images: Unsupervised Clustering vs Supervised SVM Classification
Recognizing and isolating cancerous cells from non pathological tissue areas (e.g. connective stroma) is crucial for fast and objective immunohistochemical analysis of tissue images. This operation allows the further application of fully-automated techniques for quantitative evaluation of protein activity, since it avoids the necessity of a preventive manual selection of the representative pathological areas in the image, as well as of taking pictures only in the pure-cancerous portions of the tissue. In this paper we present a fully-automated method based on unsupervised clustering that performs tissue segmentations highly comparable with those provided by a skilled operator, achieving on average an accuracy of 90%. Experimental results on a heterogeneous dataset of immunohistochemical lung cancer tissue images demonstrate that our proposed unsupervised approach overcomes the accuracy of a theoretically superior supervised method such as Support Vector Machine (SVM) by 8%
Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features
For data sets with similar features, for example highly correlated features,
most existing stability measures behave in an undesired way: They consider
features that are almost identical but have different identifiers as different
features. Existing adjusted stability measures, that is, stability measures
that take into account the similarities between features, have major
theoretical drawbacks. We introduce new adjusted stability measures that
overcome these drawbacks. We compare them to each other and to existing
stability measures based on both artificial and real sets of selected features.
Based on the results, we suggest using one new stability measure that considers
highly similar features as exchangeable
Automated segmentation of tissue images for computerized IHC analysis
This paper presents two automated methods for the segmentation ofimmunohistochemical tissue images that overcome the limitations of themanual approach aswell as of the existing computerized techniques. The first independent method, based on unsupervised color clustering, recognizes automatically the target cancerous areas in the specimen and disregards the stroma; the second method, based on colors separation and morphological processing, exploits automated segmentation of the nuclear membranes of the cancerous cells. Extensive experimental results on real tissue images demonstrate the accuracy of our techniques compared to manual segmentations; additional experiments show that our techniques are more effective in immunohistochemical images than popular approaches based on supervised learning or active contours. The proposed procedure can be exploited for any applications that require tissues and cells exploration and to perform reliable and standardized measures of the activity of specific proteins involved in multi-factorial genetic pathologie
A comparison of random forests, boosting and support vector machines for genomic selection
Genomic selection (GS) involves estimating breeding values using molecular markers spanning the entire genome. Accurate prediction of genomic breeding values (GEBVs) presents a central challenge to contemporary plant and animal breeders. The existence of a wide array of marker-based approaches for predicting breeding values makes it essential to evaluate and compare their relative predictive performances to identify approaches able to accurately predict breeding values. We evaluated the predictive accuracy of random forests (RF), stochastic gradient boosting (boosting) and support vector machines (SVMs) for predicting genomic breeding values using dense SNP markers and explored the utility of RF for ranking the predictive importance of markers for pre-screening markers or discovering chromosomal locations of QTLs
Deep Learning and Random Forest-Based Augmentation of sRNA Expression Profiles
The lack of well-structured annotations in a growing amount of RNA expression
data complicates data interoperability and reusability. Commonly - used text
mining methods extract annotations from existing unstructured data descriptions
and often provide inaccurate output that requires manual curation. Automatic
data-based augmentation (generation of annotations on the base of expression
data) can considerably improve the annotation quality and has not been
well-studied. We formulate an automatic augmentation of small RNA-seq
expression data as a classification problem and investigate deep learning (DL)
and random forest (RF) approaches to solve it. We generate tissue and sex
annotations from small RNA-seq expression data for tissues and cell lines of
homo sapiens. We validate our approach on 4243 annotated small RNA-seq samples
from the Small RNA Expression Atlas (SEA) database. The average prediction
accuracy for tissue groups is 98% (DL), for tissues - 96.5% (DL), and for sex -
77% (DL). The "one dataset out" average accuracy for tissue group prediction is
83% (DL) and 59% (RF). On average, DL provides better results as compared to
RF, and considerably improves classification performance for 'unseen' datasets
Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections
<p>Abstract</p> <p>Background</p> <p>A recent study reported that gene expression profiles from peripheral blood samples of healthy subjects prior to viral inoculation were indistinguishable from profiles of subjects who received viral challenge but remained asymptomatic and uninfected. If true, this implies that the host immune response does not have a molecular signature. Given the high sensitivity of microarray technology, we were intrigued by this result and hypothesize that it was an artifact of data analysis.</p> <p>Findings</p> <p>Using acute respiratory viral challenge microarray data, we developed a molecular signature that for the first time allowed for an accurate differentiation between uninfected subjects prior to viral inoculation and subjects who remained asymptomatic after the viral challenge.</p> <p>Conclusions</p> <p>Our findings suggest that molecular signatures can be used to characterize immune responses to viruses and may improve our understanding of susceptibility to viral infection with possible implications for vaccine development.</p
Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data
Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests
- …